You can define bullet list or numbered list:
Here you can define formula
\[ y = \beta_0 + \beta_1X + \epsilon \] the formula can be reported in the text: \(\mu = 1/n \sum X_i\)
data_SD3 <- read.delim("~/RProjects/2024-Q2-R-2 [MDA2024, exercises]/D1_SD3/data_SD3.csv", stringsAsFactors=TRUE)
Forbes2000 <- read.csv("~/RProjects/2024-Q2-R-2 [MDA2024, exercises]/D0_Data/Forbes2000.csv", stringsAsFactors=TRUE)
glass <- read.csv("~/RProjects/2024-Q2-R-2 [MDA2024, exercises]/D0_Data/glass.csv", stringsAsFactors=TRUE)
To add R code in the Notebook we need to use the Chunk.
X <- iris
It is possible to have an overview of the data by using the summary function.
summary(X)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
In R there are three main type of data:
Y <- as.matrix(X[ ,1:4])
To handle data you can use the following code:
X[10, 2] # selection of one element in the Data Frame (or matrix)
[1] 3.1
X[5:20, 1:3] # selection of an interval
X[5:20, ] # the empty space select all the columns or rows
X$Sepal.Length # the symbol $ is used to select a column in the data frame
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1
[23] 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0
[45] 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7
[67] 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3
[89] 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2
[111] 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9
[133] 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
# boxplot - horizontal True (T) [ with an error ]
# error type: Error in if (horizontal) plot.window(ylim = xlim, xlim = ylim, log = log, : the condition has length > 1
# boxplot(X$Sepal.Length, main = "Box Plot of the Sepal Length of the IRIS Flowers", col = "blue", horizontal = T)
# boxplot - horizontal False (F) [ without error ]
boxplot(X$Sepal.Length, main = "Box Plot of the Sepal Length of the IRIS Flowers", col = "blue", horizontal = F)
The elements of the box-plot are reported below:
# just a boxplot
boxplot(X[ ,1:4])
# boxplot with a title and colors
boxplot(X[ ,1:4], main = "Box Plot of all quantitative variables of IRIS data", col = terrain.colors(4))
boxplot(X$Sepal.Width ~ X$Species, main = "Box Plot of Sepal Width considering the 3 types of flowers", xlab = "Type of IRIS Flowers", ylab = "Sepal Width", col = terrain.colors(4))
The tilde symbol is obtained by:
The bold line is the Median, that is the value of the ordered distribution that leaves the same number of units above and below (or on left and right)
boxplot(X$Sepal.Length, main = "Box-Plot of the Sepal Lenght", col = "green", horizontal = F)
boxplot(X[ ,1:4], main = "Box-Plot with all the Variables", col = "blue", horizontal = F)
boxplot(X$Sepal.Width ~ X$Species, main = "Box-Plot about Sepal Width with different type of IRIS Flowers")
Bar Plot can be used for Qualitative Data and for Categorized Quantitative Data. The first step to create a Bar Plot is to generate a Table of Frequency.
T <- table(X$Species)
T
setosa versicolor virginica
50 50 50
barplot(T, main = "Bar Plot of Type of flowers", xlab = "Type of flowers", ylab = "Absolute Frequency", col = terrain.colors(4))
It is based on the frequency table.
pie(T, main = "Pie Chart", col = terrain.colors(4))
Histogram is a plot used only for Quantitative Data, it is based on a frequency tables in classes. The R function is called hist and the input is a simple distribution of a quantitative variable.
hist(X$Sepal.Length)
hist(X$Sepal.Width, main = "Histogram of Sepal Width", xlab = "Classes", ylab = "Absolute Frequency", col = "lightgreen", border = "blue")
he histogram can be used only for quantitative variables.
hist(X$Sepal.Width, main = "Histogram of the Sepal Width", xlab = "Classes",
ylab = "Absolute Frequency", col = "green", border = "red", breaks = 10)
In case of equally spaced (same size) classes we can report on the Y axis the Absolute Frequency or relative frequency. In case of classes with different sizes we have to report on Y axis the density of frequency. The formula is the following: \(d_i = n_i/h_i\), where \(n_i\) is the absolute frequency and \(h_i\) is the size of the class.
plot(X$Sepal.Length, X$Sepal.Width, main = "Correlation Plot", xlab = "Sepal Length", ylab = "Sepal Width")
plot(X$Petal.Length, X$Petal.Width, main = "Correlation Plot",
xlab = "Petal Lenght", ylab = "Petal Width", col = "blue",
pch = 1)
Plots with IRIS
plot(X$Sepal.Length, X$Sepal.Width, main = "(1-2) Plot with IRIS",
xlab = "Sepal Lenght", ylab = "Sepal Width", col = "blue")
plot(X$Petal.Length, X$Petal.Width, main = "(2-2) Plot with IRIS",
xlab = "Petal Lenght", ylab = "Petal Width", col = "red")
pairs(X[ ,1:4])
The plots below the main diagonal are the same of the plot above the main diagonal. The reason is because the plot and the correlation index are symmetric.
r <- cor(X[ ,1:2])
r <- round(r, 3)
r
Sepal.Length Sepal.Width
Sepal.Length 1.000 -0.118
Sepal.Width -0.118 1.000
cor(X$Sepal.Length, X$Sepal.Width)
[1] -0.1175698
The range of correlation index is: -1 <= r <= 1 The interpretation of the Correlation Index called r is following:
The correlation between Sepal Length and Sepal Width is -0.118 and it is a low negative correlation.
install.packages("ggplot2")
Error in install.packages : Updating loaded packages
library(ggplot2)
GGPLOTS has 3 main arguments:
# GGPLOT (example no 1)
ggplot(data = X[ ,1:2])
# GGPLOT (example no 2.1)
ggplot(data = X[ ,1:2]) +
geom_point(mapping = aes(X$Sepal.Length, X$Sepal.Width))
# GGPLOT (example no 2.2)
ggplot(data = X[ ,1:2]) +
geom_point(mapping = aes(Sepal.Length, Sepal.Width))
# GGPLOT (example no 2.2)
ggplot(data = X) +
geom_point(mapping = aes(Sepal.Length, Sepal.Width))
# GGPLOT (example no 3)
ggplot(data = X) +
geom_point(mapping = aes(Sepal.Length, Sepal.Width)) +
ggtitle("Scatter Plot") + xlab("Sepal Length") + ylab("Sepal Width")
ggplot(data = X) +
geom_point(mapping = aes(Sepal.Length, Sepal.Width, color = Species))
STANDARD TEMPLATE IS: ggplot(data = ) +
# First Example
ggplot(data = X) +
geom_point(mapping = aes(Petal.Length, Petal.Width, color = Species)) +
ggtitle("Petal Lenght and Width") +
xlab("Petal Lenght") + ylab("Petal Width")
# Second Example
ggplot(data = X, mapping = aes(Petal.Width, Petal.Length)) +
geom_point(mapping = aes(color = Species)) +
ggtitle("Petal Width and Lenght") +
xlab("Petal Width") + ylab("Petal Lenght")
ggplot(data = X) +
geom_boxplot(mapping = aes(Sepal.Width), color = "blue", outlier.colour = "red", outlier.shape = 8, outlier.size = 3) +
ggtitle("Box Plot for Sepal Lenght")
Box Plot taking into account the 3 types of flowers
ggplot(data = X) +
geom_boxplot(mapping = aes(Species, Sepal.Width), outlier.color = "red", outlier.shape = 8)
GGPLOT function as object
p <- ggplot(data = X) +
geom_boxplot(mapping = aes(Species, Sepal.Width, fill = Species))
p
p + theme(legend.position = "bottom")
ggplot(data = X) +
geom_bar(mapping = aes(Species)) +
ggtitle("Bar Plot with GGPLOT") +
ylab("Absolute Frequency")
A data frame with 234 rows and 11 variables:
Y <- mpg
summary(Y)
manufacturer model displ year cyl
Length:234 Length:234 Min. :1.600 Min. :1999 Min. :4.000
Class :character Class :character 1st Qu.:2.400 1st Qu.:1999 1st Qu.:4.000
Mode :character Mode :character Median :3.300 Median :2004 Median :6.000
Mean :3.472 Mean :2004 Mean :5.889
3rd Qu.:4.600 3rd Qu.:2008 3rd Qu.:8.000
Max. :7.000 Max. :2008 Max. :8.000
trans drv cty hwy fl
Length:234 Length:234 Min. : 9.00 Min. :12.00 Length:234
Class :character Class :character 1st Qu.:14.00 1st Qu.:18.00 Class :character
Mode :character Mode :character Median :17.00 Median :24.00 Mode :character
Mean :16.86 Mean :23.44
3rd Qu.:19.00 3rd Qu.:27.00
Max. :35.00 Max. :44.00
class
Length:234
Class :character
Mode :character
head(Y)
table(Y$cyl)
4 5 6 8
81 4 79 70
# first example: factor(cyl) as a color ( vertical bar chart )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = factor(cyl)))
# second example: class as a color ( vertical bar chart )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = class))
# third example: class instead cyl ( vertical bar chart )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = class))
# fourth example: class instead cyl ( horizontal bar chart )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = class)) + coord_flip()
# fifth example: class instead cyl ( horizontal bar chart & legend at the bottom )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = class)) + coord_flip() + theme(legend.position = "bottom")
# example no 1.1
ggplot(data = Y) +
geom_histogram(mapping = aes(cty))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# example no 1.2
ggplot(data = Y) +
geom_histogram(mapping = aes(cty, colour = class))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# example no 1.3
ggplot(data = Y) +
geom_histogram(mapping = aes(cty, fill = class))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# example no 1.4
ggplot(data = Y) +
geom_histogram(mapping = aes(cty, fill = "red"))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# example no 1.5
ggplot(data = Y) +
geom_histogram(mapping = aes(cty, fill = factor(cty)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# example 2
ggplot(data = Y) +
geom_histogram(mapping = aes(hwy))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
With Facet Wrap layer (option) we can create sub-plots based on a categorical variable.
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy)) +
facet_wrap(~drv)
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy)) +
facet_wrap(drv ~ class)
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy))
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# First Example
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy, linetype = class))
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
span too small. fewer data values than degrees of freedom.
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
pseudoinverse used at 5.6935
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
neighborhood radius 0.5065
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
reciprocal condition number 0
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
There are other near singularities as well. 0.65044
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
span too small. fewer data values than degrees of freedom.
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
pseudoinverse used at 5.6935
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
neighborhood radius 0.5065
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
reciprocal condition number 0
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
There are other near singularities as well. 0.65044
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
pseudoinverse used at 4.008
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
neighborhood radius 0.708
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
reciprocal condition number 1.6135e-17
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
There are other near singularities as well. 0.25
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
pseudoinverse used at 4.008
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
neighborhood radius 0.708
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
reciprocal condition number 1.6135e-17
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
There are other near singularities as well. 0.25
# Second Example
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy, linetype = drv))
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy)) +
facet_wrap(~ drv)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy, color = drv))
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy, group = drv))
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy, color = drv)) +
geom_smooth(mapping = aes(displ, hwy, color = drv))
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
In regression model we need to define the dependent and independent variables. In our case the model is define as follow:
In the first place we need to create a scatter plot (correlation plot).
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy)) +
geom_smooth(method = lm, mapping = aes(displ, hwy))
`geom_smooth()` using formula = 'y ~ x'
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy, color = drv)) +
geom_smooth(method = lm, mapping = aes(displ, hwy, color = drv))
`geom_smooth()` using formula = 'y ~ x'
res.reg <- lm(hwy ~ displ, data = Y)
summary(res.reg)
Call:
lm(formula = hwy ~ displ, data = Y)
Residuals:
Min 1Q Median 3Q Max
-7.1039 -2.1646 -0.2242 2.0589 15.0105
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.6977 0.7204 49.55 <2e-16 ***
displ -3.5306 0.1945 -18.15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.836 on 232 degrees of freedom
Multiple R-squared: 0.5868, Adjusted R-squared: 0.585
F-statistic: 329.5 on 1 and 232 DF, p-value: < 2.2e-16
The regression coefficients are defined as follow:
The results of the regression model showed that the estimated \(\beta_0\) (the intercept) is equal to 35.6977 and the estimated \(\beta_1\) (the regression coefficient/slope) is equal to -3.3506.
The quality of the model is measured by the godness of fix index \(R^2\) that is for this case study equal to 0.585. The \(R^2\) is interpreted as follow:
In this case study we have a Medium-High Quality of the model.
To test the regression parameters it is possible to use the t-value or the p-value, as reported below:
In this case study the estimated regression coefficient \(\beta_1\) is statistically significant, since the t-value is -18.15 (< 2) and the p-value is 2e-16 (< 5%). It means that the number of mile per gallon (hwy) depends by the power of the engine (displ), so by increasing the displ of 1 unit, the hwy will decrease by -3.5306.
library(FactoMineR)
library(factoextra)
library(ggplot2)
AUTO <- read.csv("~/RProjects/2024-Q2-R-2 [MDA2024, exercises]/D2_AUTO/AUTO.csv", row.names=1, dec=",")
res.pca <- PCA(AUTO, graph = F)
summary.PCA(res.pca)
Call:
PCA(X = AUTO, graph = F)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9
Variance 3.184 2.032 1.489 1.339 0.985 0.400 0.335 0.132 0.085
% of var. 31.844 20.321 14.892 13.389 9.852 4.000 3.348 1.319 0.849
Cumulative % of var. 31.844 52.166 67.058 80.447 90.299 94.299 97.648 98.967 99.816
Dim.10
Variance 0.018
% of var. 0.184
Cumulative % of var. 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
A155 | 4.094 | 3.053 19.520 0.556 | -2.037 13.619 0.248 | -0.321 0.461 0.006 |
AU80 | 3.722 | -1.451 4.406 0.152 | -1.774 10.321 0.227 | 2.480 27.541 0.444 |
BMW3 | 4.306 | 3.181 21.190 0.546 | -0.561 1.031 0.017 | 0.016 0.001 0.000 |
CXAN | 3.093 | 1.072 2.405 0.120 | 2.357 18.229 0.581 | -0.602 1.622 0.038 |
TEMP | 1.642 | -0.056 0.007 0.001 | -0.708 1.644 0.186 | 1.190 6.336 0.525 |
MOND | 3.695 | 2.894 17.536 0.614 | 1.762 10.182 0.227 | 0.327 0.477 0.008 |
DELT | 3.818 | 0.085 0.015 0.000 | -1.886 11.669 0.244 | 0.601 1.615 0.025 |
DEDR | 1.375 | 0.151 0.048 0.012 | -0.798 2.087 0.337 | -0.248 0.275 0.032 |
PRIM | 3.431 | -1.195 2.990 0.121 | -0.989 3.208 0.083 | -2.298 23.646 0.449 |
VECT | 2.335 | -0.808 1.366 0.120 | 1.087 3.874 0.217 | -0.303 0.411 0.017 |
Variables
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
prezzo | 0.445 6.205 0.198 | -0.087 0.372 0.008 | 0.379 9.665 0.144 |
cilindrata | 0.659 13.642 0.434 | 0.403 8.009 0.163 | 0.058 0.223 0.003 |
cavalli | 0.849 22.634 0.721 | -0.399 7.853 0.160 | 0.018 0.022 0.000 |
lunghezza | -0.111 0.387 0.012 | 0.542 14.444 0.294 | 0.213 3.046 0.045 |
larghezza | 0.334 3.513 0.112 | 0.742 27.064 0.550 | 0.177 2.113 0.031 |
peso | 0.652 13.357 0.425 | -0.333 5.450 0.111 | 0.284 5.400 0.080 |
velocita | 0.923 26.770 0.852 | 0.001 0.000 0.000 | 0.144 1.389 0.021 |
cons_strada | -0.191 1.149 0.037 | -0.805 31.881 0.648 | 0.425 12.147 0.181 |
cons_urbano | -0.091 0.261 0.008 | 0.315 4.898 0.100 | 0.709 33.791 0.503 |
affidabilita | -0.620 12.079 0.385 | -0.025 0.030 0.001 | 0.693 32.205 0.480 |
The aim of PCA is to create new “Latent Variable” (Factominer language = “Comp” - Components) as linear combination of the original ones. The number of LVs should be equal to q < than p, where “p” is the number of the manifest/original variables, in this cases study p=10. To select the number q of LVs we use the eigenvalues that represents the level of information extracted (explained variance extracted) from the original data. The rules to select the number of LVs are the following: - Eigenvalues greater than 1: \(\lambda > 1\) - Cumulative percentage of explained variance \(> 70%\) (\(>0.7\)) (around \(70%\))
In this case, considering the two eigenvalues criteria, we should select for the \(\lambda > 1\) four LVs; for the second criteria we select only 3 LVs. In general, for the exame report, you select max 2 LVs.
round(res.pca$eig, 3)
eigenvalue percentage of variance cumulative percentage of variance
comp 1 3.184 31.844 31.844
comp 2 2.032 20.321 52.166
comp 3 1.489 14.892 67.058
comp 4 1.339 13.389 80.447
comp 5 0.985 9.852 90.299
comp 6 0.400 4.000 94.299
comp 7 0.335 3.348 97.648
comp 8 0.132 1.319 98.967
comp 9 0.085 0.849 99.816
comp 10 0.018 0.184 100.000
The table reported below is called table of correlation or table of the coordinates, since in case the data are standardized, correlations and coordinations have the same values. The table reports the correlations between the Manifest Variables (original variables) and the LVs (Dimensions in Factominer language). The correlations represent also the coordinates of the manifest on the Variable Plot.
For the interpretation:
In this case study the two LVs (Dimensions) are so defined: - LV_1-Dim.1: Right Side: Cilindrata, Cavalli, Velocita; Left Side: Affidabilita (reliability) - LV_2-Dim.2: Upper Side: Lunghezza, Larghezza (lenght, large); Lower Side: cons_strada (km per liter)
round(res.pca$var$coord, 3)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
prezzo 0.445 -0.087 0.379 0.434 -0.608
cilindrata 0.659 0.403 0.058 -0.389 -0.333
cavalli 0.849 -0.399 0.018 0.162 0.172
lunghezza -0.111 0.542 0.213 0.743 0.152
larghezza 0.334 0.742 0.177 -0.068 0.486
peso 0.652 -0.333 0.284 0.110 0.418
velocita 0.923 0.001 0.144 0.012 -0.077
cons_strada -0.191 -0.805 0.425 -0.162 0.178
cons_urbano -0.091 0.315 0.709 -0.574 -0.035
affidabilita -0.620 -0.025 0.693 0.219 -0.044
plot.PCA(res.pca, choix = "var")
It is possible to evaluate the correlation between MVs, by using the angles between the vectors (arrows):
The table reported below is the *Table of \(\cos^2\)** that gives information about the quality of the graphical representation about the MVs. The interpretation is made as follow:
For instance, in the case study, cavalli has Medium-High Quality (\(\cos^2 = 0.721\)) and velocita has High Quality since \(\cos^2 = 0.825\).
round(res.pca$var$cos2, 3)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
prezzo 0.198 0.008 0.144 0.189 0.370
cilindrata 0.434 0.163 0.003 0.152 0.111
cavalli 0.721 0.160 0.000 0.026 0.030
lunghezza 0.012 0.294 0.045 0.551 0.023
larghezza 0.112 0.550 0.031 0.005 0.236
peso 0.425 0.111 0.080 0.012 0.175
velocita 0.852 0.000 0.021 0.000 0.006
cons_strada 0.037 0.648 0.181 0.026 0.032
cons_urbano 0.008 0.100 0.503 0.330 0.001
affidabilita 0.385 0.001 0.480 0.048 0.002
The table reported below is called Table of the Units Coordinates and gives information about the position of the units, in this case study about the cars, on the plot, based on the meaning (interpretation) given by the Variable Plot. In the Units Plot, the coordinates are not the correlations.
round(res.pca$ind$coord, 3)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
A155 3.053 -2.037 -0.321 0.467 1.120
AU80 -1.451 -1.774 2.480 1.220 0.532
BMW3 3.181 -0.561 0.016 1.604 -2.214
CXAN 1.072 2.357 -0.602 -0.539 1.164
TEMP -0.056 -0.708 1.190 -0.572 -0.384
MOND 2.894 1.762 0.327 -0.238 1.164
DELT 0.085 -1.886 0.601 -3.228 0.028
DEDR 0.151 -0.798 -0.248 -0.279 0.102
PRIM -1.195 -0.989 -2.298 1.561 1.247
VECT -0.808 1.087 -0.303 -0.521 -1.359
P405 -3.257 -0.803 -0.896 0.395 0.143
RE21 -1.569 1.885 1.533 0.358 0.371
GOLF -0.707 0.484 -1.798 -1.053 -0.527
PASS -0.836 1.685 1.151 0.722 -0.114
VOL4 -0.559 0.297 -0.832 0.103 -1.272
plot.PCA(res.pca, choix = "ind")
The Principal Component Analysis has the following aims and characteristics:
library(FactoMineR)
library(ggplot2)
library(factoextra)
For the PCA we will use the PCA function in the Package FactoMineR.
Y <- mpg
res.pca <- PCA(Y[,c(3,5,8,9)])
summary(res.pca)
Call:
PCA(X = Y[, c(3, 5, 8, 9)])
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4
Variance 3.509 0.379 0.071 0.041
% of var. 87.736 9.477 1.772 1.014
Cumulative % of var. 87.736 97.214 98.986 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
1 | 2.002 | 1.832 0.408 0.837 | -0.600 0.406 0.090 | -0.187 0.212 0.009 |
2 | 2.211 | 2.190 0.584 0.981 | -0.289 0.094 0.017 | -0.087 0.046 0.002 |
3 | 2.202 | 2.160 0.568 0.963 | -0.129 0.019 0.003 | -0.056 0.019 0.001 |
4 | 2.203 | 2.196 0.587 0.994 | -0.118 0.016 0.003 | 0.000 0.000 0.000 |
5 | 0.709 | 0.336 0.014 0.225 | -0.077 0.007 0.012 | -0.498 1.493 0.493 |
6 | 0.731 | 0.575 0.040 0.619 | 0.130 0.019 0.032 | -0.431 1.119 0.348 |
7 | 0.720 | 0.543 0.036 0.568 | 0.340 0.130 0.222 | -0.290 0.506 0.162 |
8 | 1.822 | 1.581 0.304 0.753 | -0.879 0.872 0.233 | -0.122 0.089 0.004 |
9 | 1.781 | 1.258 0.193 0.499 | -1.180 1.569 0.439 | -0.167 0.167 0.009 |
10 | 1.954 | 1.910 0.444 0.955 | -0.408 0.188 0.044 | 0.010 0.001 0.000 |
Variables
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
displ | -0.932 24.773 0.869 | 0.309 25.155 0.095 | 0.187 49.213 0.035 |
cyl | -0.933 24.821 0.871 | 0.307 24.860 0.094 | -0.183 47.082 0.033 |
cty | 0.951 25.754 0.904 | 0.271 19.334 0.073 | 0.038 2.010 0.001 |
hwy | 0.930 24.652 0.865 | 0.341 30.652 0.116 | -0.035 1.694 0.001 |
Z <- mtcars
res.pca1 <- PCA(Z)
Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider increasing max.overlaps
summary(res.pca1)
Call:
PCA(X = Z)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7 Dim.8 Dim.9
Variance 6.608 2.650 0.627 0.270 0.223 0.212 0.135 0.123 0.077
% of var. 60.076 24.095 5.702 2.451 2.031 1.924 1.230 1.117 0.700
Cumulative % of var. 60.076 84.172 89.873 92.324 94.356 96.279 97.509 98.626 99.327
Dim.10 Dim.11
Variance 0.052 0.022
% of var. 0.473 0.200
Cumulative % of var. 99.800 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
Mazda RX4 | 2.234 | -0.657 0.204 0.087 | 1.735 3.551 0.604 | -0.601 1.801 0.072
Mazda RX4 Wag | 2.081 | -0.629 0.187 0.091 | 1.550 2.833 0.555 | -0.382 0.728 0.034
Datsun 710 | 2.987 | -2.779 3.653 0.866 | -0.146 0.025 0.002 | -0.241 0.290 0.007
Hornet 4 Drive | 2.521 | -0.312 0.046 0.015 | -2.363 6.584 0.879 | -0.136 0.092 0.003
Hornet Sportabout | 2.456 | 1.974 1.844 0.646 | -0.754 0.671 0.094 | -1.134 6.412 0.213
Valiant | 3.014 | -0.056 0.001 0.000 | -2.786 9.151 0.855 | 0.164 0.134 0.003
Duster 360 | 3.187 | 3.003 4.264 0.888 | 0.335 0.132 0.011 | -0.363 0.656 0.013
Merc 240D | 2.841 | -2.055 1.998 0.523 | -1.465 2.531 0.266 | 0.944 4.439 0.110
Merc 230 | 3.733 | -2.287 2.474 0.375 | -1.984 4.639 0.282 | 1.797 16.094 0.232
Merc 280 | 1.907 | -0.526 0.131 0.076 | -0.162 0.031 0.007 | 1.493 11.103 0.613
Mazda RX4 |
Mazda RX4 Wag |
Datsun 710 |
Hornet 4 Drive |
Hornet Sportabout |
Valiant |
Duster 360 |
Merc 240D |
Merc 230 |
Merc 280 |
Variables (the 10 first)
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
mpg | -0.932 13.143 0.869 | 0.026 0.026 0.001 | -0.179 5.096 0.032 |
cyl | 0.961 13.981 0.924 | 0.071 0.191 0.005 | -0.139 3.073 0.019 |
disp | 0.946 13.556 0.896 | -0.080 0.243 0.006 | -0.049 0.378 0.002 |
hp | 0.848 10.894 0.720 | 0.405 6.189 0.164 | 0.111 1.960 0.012 |
drat | -0.756 8.653 0.572 | 0.447 7.546 0.200 | 0.128 2.598 0.016 |
wt | 0.890 11.979 0.792 | -0.233 2.046 0.054 | 0.271 11.684 0.073 |
qsec | -0.515 4.018 0.266 | -0.754 21.472 0.569 | 0.319 16.255 0.102 |
vs | -0.788 9.395 0.621 | -0.377 5.366 0.142 | 0.340 18.388 0.115 |
am | -0.604 5.520 0.365 | 0.699 18.440 0.489 | -0.163 4.234 0.027 |
gear | -0.532 4.281 0.283 | 0.753 21.377 0.567 | 0.229 8.397 0.053 |
The eigenvalues measure the level of information retained by the PCA. They show the explained variance from the data and help us to select the number of Latent Variables (or Latent Traits, Principal Components, Dimensions). The rules to select the number of LVs are: - Eigenvalues greater than 1. \(\lambda > 1\). - Cumulative percentage of explained variance > 0.7 (70%)
In our case study we select two components. Both components have eigenvalue greater than 1 and with and Cumulative percentage of explained variance equal to 84.172%.
round(res.pca1$eig,3)
eigenvalue percentage of variance cumulative percentage of variance
comp 1 6.608 60.076 60.076
comp 2 2.650 24.095 84.172
comp 3 0.627 5.702 89.873
comp 4 0.270 2.451 92.324
comp 5 0.223 2.031 94.356
comp 6 0.212 1.924 96.279
comp 7 0.135 1.230 97.509
comp 8 0.123 1.117 98.626
comp 9 0.077 0.700 99.327
comp 10 0.052 0.473 99.800
comp 11 0.022 0.200 100.000
Below we can read the table with correlations between the Manifest Variables (MVs) and the Latent Variables (Dimensions). The correlations are also the coordinates of the manifest variables on the variables plot.
For the interpretation:
In our case study, the LVs (dimensions) are so defined:
round(res.pca1$var$coord[,1:2],3)
Dim.1 Dim.2
mpg -0.932 0.026
cyl 0.961 0.071
disp 0.946 -0.080
hp 0.848 0.405
drat -0.756 0.447
wt 0.890 -0.233
qsec -0.515 -0.754
vs -0.788 -0.377
am -0.604 0.699
gear -0.532 0.753
carb 0.550 0.673
it is possible to evaluate the correlations between the manifest variables by using the angle between the vectors. For instance, the variables hp, cyl, disp and wt are all positively correlated since all the angles are less then 90°.
plot.PCA(res.pca1, axes = c(1,2), choix = "var")
The quality of the projected manifest variables is given by the Cosine squared.
Below you can find the interpretation intervals:
round(res.pca1$var$cos2[,1:2],3)
Dim.1 Dim.2
mpg 0.869 0.001
cyl 0.924 0.005
disp 0.896 0.006
hp 0.720 0.164
drat 0.572 0.200
wt 0.792 0.054
qsec 0.266 0.569
vs 0.621 0.142
am 0.365 0.489
gear 0.283 0.567
carb 0.303 0.453
In the units plot we can evaluate the clusters of units. The clusters define group of units that are similar on the basis of the Latent Variables estimated.
plot.PCA(res.pca1, axes = c(1,2), choix = "ind")
Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider increasing max.overlaps
library(factoextra)
fviz_pca_var(res.pca1)
fviz_pca_ind(res.pca1)
Biplot is a plot where both units and variables are represented.
fviz_pca_biplot(res.pca1)
fviz_screeplot(res.pca1)
fviz_pca_contrib(res.pca1, choice = "var")
Warning in fviz_pca_contrib(res.pca1, choice = "var") :
The function fviz_pca_contrib() is deprecated. Please use the function fviz_contrib() which can handle outputs of PCA, CA and MCA functions.